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ABSTBACT 

Procedures for deteriining cutting scores have been 
proposed by Angoff and by Nedelsky. Nedelslcy's approach requires that 
a rater exaiine each distractor within a test itei to deteriine the 
probability of a ■iniially ccii*.. tent ezaiinee answering correctly: 
whereas Angoff uses a judgment based on the whole item, rather than 
«ach of its components. The reliability of these approaches depends 
upon the extent to which raters agree in their judgments. 
Generalizability theory was used to quantify the magnitude of error 
variance in each procedure: to compare data resulting from each 
procedure: and to examine the impact of rater disagreement on test 
reliability. Five subject experts ratol the probability of answering 
correctly for a total of 126 four-option items in a health- related 
area. Both procedures were used by the same raters. Cutting score was 
assumed to be the observed mean (probability) over raters and items. 
In this sense, the expected variability of the observed mean was 
error variance attributable to the procedure used. Results indicated 
that both the cutting scores and their expected variance were 
considerably different for the two procedures, and suggested that 
differences betwean the procedures may be of greater consequence than 
their apparent similarities. (GDC) 
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Abstract 



Nedelsky and Angoff have suggested procedures for establishing a cutting 
score based on raters' judgments about the likely performance of minimally 
coii4)etent examinees on each item in a test. In this paper, generalizability 
theory is used to characterize and quantify expected variance in cutting scores 
resulting from each procedure. Data for a 126-item test are used to illustrate 
this approach and to compare the two procedures. Finally, consideration is 
given to the impact of rater disagreement on some issues of measurement relia- 
bility or dependability. Results suggest that the differences between the 
Nedelsky and Angoff procedures may be of greater consequence than their apparent 
similarities. 
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Introduction 

Currently / there is considereUDle debute concerning the setting of passing 
stcmdards when scores on tests ara j<Mzid to make decisions regarding minimal 
conqpetency^ proficiency / licenstire^ :;ertification, and the award of credit 
(see, for exaii5>le, NCME, 1978). Me^kauskas (1976), Buck (1977), and Zieky 
and Livingston (1977), among others, have reviewed current procedures for 
establishing cutting scores. For the most part, these procedures can be 
grouped into two categories — procedxires based on subjective judgments of 
subject matter specialists, and procedures that use examinee scores on the 
test itself and/or some criterion measure* The latter procedures are not 
discussed in this paper; rather, primary emphasis is placed upon studying 
two procedures suggested by Nedelsky (1954'/ and Angoff (1971) for establish- 
ing cutting scores based upon the judgments of subject matter experts. 

Nedelsky and Angoff Procedures 

Both of these procedures require jvidcaaents by raters concerning the per- 
formance of hypothetical minimall/ competent examinees on each item of a test. 
The approach described by Nedelsky (1954) requires that a rater examine each 
distractor within an item to determine the probability that a minimally com- 
petent examinee would answer that question correctly; whereas the approach 
described by Angoff (1971) makes use of a judgment based on the whole item 
rather than its individual components. 

Using Nedelsky 's procedure, raters are asked to identify, for each item, 
those dis tractors that a minimally competent examinee would eliminate as in- 
correct. The reciprocal of the number of rt;!raining alternatives (including 
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the correct answer) serves as an estimate of the probability that a minimally 
competent examinee would get the item correct; and the meeui of these item 
probabilities^ over items cind raters ^ is defined as the cutting score for 
the test (ill terms of proportion of items correct) ♦ In Angoff's procedure, 
raters singly provide an estimate of the item probabilities without specifi- 
cally identifying which distractors a minimally competent examinee would 
eliminate* Again, the mean of these item probabilities, over items £md 
raters, is defined as the cutting score for the test* Notationally, through- 
out this paper, we use X to denote the cutting score, or more specifically, 
the mean cutting score that results from a particular study* For a particular 
rater, r, the mean of that rater's item probabilities will be denoted X , which 

— — IT 

can be interpreted as the cutting score that would be assigned by that parti- 
cular rater 

Issues , Approach , and Data Sets 

The Nedelsky and Angoff procedures are appealing in many contexts because 
they are understandable to raters and test users: and these procedures force 
raters to give detailed consideration to the specific content of a test, rather 
than to its general characteristics* However, the validity and practical utility 
of these approaches, and similar approaches, may rest heavily upon the extent 
to which raters agree in their judgments* This issue has received very little 
attention in the context of establishing cutting scores, although Andrew and 
Hecht (1976) do address some aspects of this issue* The principal purposes of 
this paper are: (a) to identify a psychometric approach for characterizing and 
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quantifying the magnitude of error variances (in either cutting score pro- 
cedure) attributable to disagreement evident in rater judgments; (b) to apply 
this approach to data resulting from the Nedelsky and Angoff procedures, and 
thereby compare the two procedures; and (c) to exeunine the impact of rater 
disagreement on some issues relating to the reliability or dependability of 
measurement. 

The principal psychometric approach employed to address these issues is 
based upon generalizability theory, although some aspects of these issues are 
addressed in more traditional ways. Generalizability theory (see Cronbach, 
Gleser, Nanda, and Rajaratnam, 1972) is especially appropriate here because 
it allows us to differentiate among multiple sources of error in a systematic 
manner. In the body of this paper we introduce and ei^lain concepts and 
equations from generalizability theory, as needed; but we do not usually 
prove results. Readers desiring more detail are referred to Cronbach et al- 
(1972) and/or Brennan (1977). It should be noted that there are many aspects 
of generalizability theory that do not concern us in this paper. For example, 
we never report a generalizability coefficient and in only one instance do we 
refer to a universe score variance, as this tena is defined in general:" zability 
theory. Indeed, the approach used here is essentially variance components 
analysis viewed from the perspective of generalizability theory. 

We en?)loy three data sets to illustrate our approach to issues of rater 
disagreement and to compare the Nedelsky and Angoff procedvures. Data set 1 
consists of the Angoff probabilities assigned by five raters to each of the 
126 four-alternative items constituting a test in a health-related area. Data 
set 2 consists of the (inferred) Nedelsky probabilities assigned by the same 



ERIC 



8 



Cutting Score Procedure^ 
4 



raters to the s^^^ items; and data set 3 consists of the eliminated distractors, 
for each item and rater^ that form the basis for inferring the Nedelsky proba- 
bilities in data set 2. Each of the five raters is a practitioner or teacher 
in the appropriate field; and/ their ratings were provided independently* At 
the conclusion of the study ^ raters were instructed to discuss each item smd 
provide a consensus Angoff-type judgment* These reconciled judgments cire 
examined, although not extensively , in a separate analysis* 

Both the Nedelsky and TVngoff procedures necessitate judgments about 
"minimum competence*" In one section of this paper , we briefly consider some 
aspects of how the two procedijres allow a rater to operationalize some con- 
ception of minimum competence* Otherwise , however , this paper is not intended 
to treat educational , philosophical, or psychological issues associated with 
defining minimum competence* Also, throughout this paper, except in one 
section, we restrict ourselves to consideration of X as a cutting score* 
Finally, we recognize that in realistic settings evaluations sometimes use 
more than one cutting score procedure, or a .'ariant of the procedures dis- 
cussed here* This paper is not intended to address such issues in any detail* 
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The £ X i. D esign and the Angoff Procedure 

For the Angoff procedure, the probability assigned by the rater r to item 
i can be represented as 

-ri r 1 ri 

where X - grauid mean for the population of raters and the universe 
of items / 

X '^^ = effect for rater x, 

X 'V = effect for item x, and 
i 

X - effect for the interaction of rater r and item i^* 
ri ^ 

(Technically, since we have only one observation for each rater-item combination, 

the effect X is completely confounded with any other sources of variation — 
ri 

sometimes called "random" or "experimental" error*) Here, unless otherwise 
noted, we will assume that the actual raters in the study can be considered a 
random saniple from an essentially infinite populatio n of raters; and, that the 
actual items can be considered a random sample from an essentially infinite 
universe of items ♦ Under this assumption, eind assuming independent effects 
that sum to zero. Equation 1 represents what is usually called a ra ndom effects 
model for the £ x i^ design ♦ 

Given chis model, for rater r, the average probability over the universe 
of it^ms is 



X = X + X 
r r 
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whereas the average probability over the sample of items is X^. Similarly, 
for item i^, the average probability over the population of raters is 

i. i 

and the average probability over the sample of n^ raters is X^. 
Sample Statistics 

In terms of sample statistics, TcUDle 1 reports means, standard deviations 
and inter correlations among raters, for data set 1 and the Angoff procedure. 
In Table 1 and subsequent tables, all results, except those within parentheses, 
are in terms of probabilities. Results within parentheses are in terms of 
number of items. For example. Table 1 reports that the mean probability over 
the n = 5 raters and n. = 126 items in this study is X - 0.6632. In effect, 
this average probaODility is the (mean) cutting score, in terms of proportion of 
items correct, arrived at using the Angoff procedure. In terms of number of 
items correct, the (mean) cutting score is n^X' 83.56, as reported in Table 1. 
Also, Table 1 reports that the standcurd deviation of the rater mean probabili- 
ties is 0.0373. This is the standard deviation of the cutting scores for the 
five raters, in terms of proportions of items correct. The corresponding 
standard deviation in terms of number of items correct is 4.70. 



Insert Table 1 about here 
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We will examine the results reported in Table 1 in somewhat more detail ^ 
later. Here, we sin«>ly note that Table 1 suggests that there is some degree 
of variability among rater mesms, as reflected by a(X^); there is some degree 
of variability within each rater, a5S reflected by a(X .); and there is some 
degree of variability in the rater intercorrelations. The sample statistics 
in Table 1, however, do not indicate clearly the variability in the mean 
cutting score, X, which is a principal concern of this paper. In other 
words, we would like some estimate of the variance (or standard deviation) 
of X if the entire study were replicated with different samples of raters 
and/or items. To obtain such estimates we employ generalizability theory. 

General! zabil ity Theory 

Given the random effects model in Equation 1, Table 2 reports equations 

for estimating the variance components associated with each of the score 

effects in the model. For example, (r) is an unbiased estimate of the 

variance of X (or X %) over the population of raters. (Recall that X is the 
r r ^—^ £ 

expected value, over the universe of items, of the probabilities assigned by 
rater r . ) Similarly, {i) is an unbiased estimate of the variance of X^ 
(or \,%) over the universe of items. 

It is important that (r) be differentiated from (X^) . The former is 
an estimate of the variance, over the population of raters, of the scores (or 
probabilities) X^; while the latter is the variance, over the sample of raters, 
of the scores (or probabilities) X^.* terms of the random effects variance 
components in Table 2, 
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A A A 

o2(x ) « a2(r) + a2(ri)/ru . (2) 

In other words, the observed variance of rater means over the n^ - 126 items 
can be decomposed into two parts — one part that is uniquely associated with 
raters, and another part that is associated with the interaction of raters 
and items* 



Insert TadDle 2 about here 



Table 2 also reports three equations for estimating the expected variance 

of )(. Each of these equations is expressed solely in terms of random effects 

variance components and s2Utple sizes. The sample sizes in these equations are 
« 

identified with primes to distinguish them from the sample sizes that character- 
ize the actual data available. We say that ii represents a G study seuiple size 
and n^ represents a D study s£unple size. In the body of this paper, unless 
otherwise noted, we will assume that the G study and D study sample sizes are 
equal. In an Appendix we provide a more detailed consideration of distinctions 
between G studies and D studies for the £ x i^ design. 

Equation 3 in TedDle 2 provides the expected value of the variance of the 
mean cutting score, X, for generalizing over samples of n^ rators and nT items. 
We can conceive of the possibility of determining X a "very laurge" number of 
times— each time using a different sample of n^ raters and n^ items. Equation 
3 estimates the variance of the distribution of the "very large" number of means 
that would result from such replications. 
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It is in this sense that we say (X) is the variance of the mean for general- 
izing over both samples of raters and sanqples of items. 

Equation 4 in Table 2 is the expected variance of X generalizing over 
sari>les of n^ items # for a fixed set of n^ raters. We denote this variance 

A 

a2(X|R*) to en«>hasize that raters are considered fixed. Again, we can conceive 
of the possibility of determining ^ a "very large" number of times — each time 



using a different saii5>le of n1 items but the same n'' raters. o^CxIr*) is an 

"T. T 

unbiased estimate of the variance of this distribution of means. Similarly, 

A 

a2(x|l*) in Equation 5 is the expected variance of X generalizing over samples 
of n raters, for a fixed set of n, items. 

~T — 1 

In brief, is for generalizing over samples of raters, a^(X|R*) 

A 

is for generalizing over sanqples of items, and (X) is for generalizing over 
sanples of both raters and items. These then are three different estimates of 
error variance in the mean cutting score. Whic> )f these estimates is appropriate 
can be determined only in the context of a specific study; i.e., it is the de- 
cision-maker who must determine whether it is appropriate to generalize over 
samples of raters, items, or both. It is evident from Equations 3 to 5, however, 
that o^iX) must be at least as larje as o^(x|r*) and o^(x|^*). This follows from 

_ ^A 

the fact that (x| R*) does not involve variability due to raters, a^(£)# and 

A « 

a^(xll*) does not involve variability due to iter.is, . 



Generalizability Results for Angoff Procedure 

For data set 1 and the Angoff procedure. Table 3 reports the usual TVNOVA 
results, estimated random effects variance components, and estimates of mean 
cutting score variability. As reflected by the above discussion, it is usual 
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in generalizcibility theory to report results in terms of variances; however , 
in Table 3 we also report the three estimates of mean score variability in 
terms of standard deviations to facilitate interpretation. We note, for 
example, that in terms of proportion of items, the standard deviation of X, 
for generalizing over raters and items, is 0.0182; and in teinas of number of 
items, it is 2.29. Furthermore, a (X) and c(x|^*) have approximately the same 
magnitude; and both of chem are almost twice as large as a(x|R*). Clearly, 
for these data, the decision concerning whether or not to generalize over 
raters is an important determiner of the magnitude of the standard deviation 
of X. 



Insert Table 3 about here 



The results reported in Teible 3 are based on the assumption that the 
G and D study sample sizes are the saune; i.e.,n =n''=5 emd n. = nT = 126. 
The equations in Table 2 can be used, however, to determine the expected 
variability of X for different numbers of raters and/or items. For example, 
the reader can verify that, if the number of raters were doubled to n** = 10 
and the number of items remained unchanged, then the values of a(X), a(x|R*), 

A 

and a(x|l^*) would be 0.0138(1.74), 0.0084(1.05), and 0.0130(1.64), respectively. 
As predicted by the equations in Table 2, increasing the number of raters on 
which X is based decreases the expected variability of the distribution of 
mean scores. 
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All of the above results depend upon one assumption — neunely/ that when- 
ever we generalize over a facet (raters or items) , we assume that the facet 
enconpasses an essentially infiniue number of observations. Sometimes eval- 
uators wish to generalize to a finite population of raters and/or a finite 
universe of items. In this case^ the equations in Table 2 are no longer 
appropriate^ and the Appendix provides two equivalent e35)ressions for esti- 
mating the expected variance of X for a population of raters of any size, 
N , 'and a universe of items of any size, N^. . 
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The Nedelsky Procedure 

As discussed in the introduction to this paper^ there cire both similarities 
and differences between the Nedelsky and Angoff procediores. The two procedures 
are similar in that/ for each item and rater ^ they both provide a probability 
that a minimally coinpetent examinee will get an item correct. The Angoff 
procedure directly elicits this probability from each rater/ whereas the 
Nedelsky procedure involves inferring this probability from the number of dis- 
tractors that a rater believes would be eliminated by a minimally competent 
examinee. Here we consider both aspects of the Nedelsky procedure / beginning 
with an analysis of the Nedelsky probeUDilities (data set 2) , which parallels 
our previous analysis of the Angoff procedure. Then we examine the eliminated 
distractors for the Nedelsky procedure (data set 3) . 

Probabilities of Correct Responses 

Table 4 reports some sairple statistics for the Nedelsky procedure based 
upon data set 2, the Nedelsky probabilities of a correct response. We note 
that the mean cutting score, X, is 0.5563 (70.09) for the Nedelsky procedure; 
whereas for the Angoff procedure, X is 0.6632 (83.56), as indicated in Table 1. 
Clearly, there is a sxibstantial difference in mean scores for the two procedures. 
Furthermore, Tables 1 and 4 indicate that the standard deviation of the rater 
means for the Nedelsky procedure is approximately double the corresponding 
standard deviation for the Angoff procedure. 



Insert Table 4 aUx)ut here 
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Table 5 reports a generalizability analysis of the Nedelsky probabilities 
based upon the same model, assuinptions , and sample sizes used in presenting 
the corresponding results for the Angoff procedure in Table 3. In comparing 
the Nedelsky results in Table 5 with the Angoff results in Table 3, we note 
that each of the random effects variance components [o^ (r) , {i) , and o (rl) J 
for the Nedelsky procedure is considerably larger than the corresponding 
variance component for the Angoff procedure. This fact directly results in 
larger estimates of o(X), o(x|r*), and o(x|l*), for the Nedelsky procedure. 
a(x|R*), for generalizing over items, is approximately the same for the two 
procedures. However, o (X) , for generalizing over both raters and items, is 
twice as large for the Nedelsky procedure. A similar statement holds for 
a(x|l.*), when generalization is over raters only. In a later section, we 
examine these and other differences between the two procedures in more detail. 



Insert Table 5 about here 



Eliminated Distractors 

One way of viewing the results presented thus far is that, in terms of 

A 

setting a single cutting score with the Nedelsky or Angoff procedure, o (ri) 
is always a source of error, (r) is a source of error if generalization is 
over raters, and o^- (i) is a source of error if generalization is over items. 
This statement u based upon the linear model in Equation 1 for the probability 
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assigned by a rater to an item. In the Nedelsky procedure, however^ the data 
that are actually collected are eliminated distractors, not probabilities; 
even though the cutting score resulting from the Nedelsky procedure is based 
directly upon probabilities. (Technically, the cutting score is a linear 
function of the inferred probabilities, and a nonlinear function of the 
eliminated distractors. ) 

Several interesting, but potentially confounding, issues arise when we 
consider the set of eliminated distractors for raters and items. One of these 
issues is discussed below, and other issues are treated later. For a given 
item, if two raters indicate that the same number of distractors could be 
eliminated, then the (inferred) probability for these two raters will be the 
same, whether or not the raters agree on which distractors could be eliminated. 
Technically, in terms of the way Nedelsky formulated his procedure, such dis- 
agreement fiuiong raters has no bearing upon the cutting score that results from 
the procedure. However, it seems reasonable to believe that one's confidence 
in the Nedelsky procedure, in a specific context, might be influenced by the 
extent to which raters agree not only with respect to the number of distractors 
eliminated, but also with respect to which distractors could be eliminated. 

To examine this issue, variance con$>onents can be estimated for a design 
in which raters are crossed with items, and distractors, d, are nested within 
items. We denote this design £ x (d:i^) Formulas for estimating variance com- 
ponents for this design are presented in Table 6, along with the estimated 
variance coxnponents for data set 3. It is usual in many applications of 
generalizability theory to report ramdom effects variance components, based 



* 



Cutting Score Procedures 
15 

on the assumption that the population (or universe) size for each facet is essen- 
tially infinite* In this case, however, it seeBis unreasonable to consider the 
n^ =^ 3 distractors associated with each item as a sajtple from an essentially 
infinite universe of possible distractors for the item. Therefore, in Table 6, 
the variance components are reported under the assumption that distractors are 
fixed, and this assumption is indicated by the notation Di* • 



Insert Table 6 about here 



Let us concentrate on the two variance components in Table 6 that involve 
variability attributable to distractors. The variance component cJ^(d:i^|p*) 
reflects the average, over items, of the variance attributsQ^le to the proportion 

A 

of raters who elindnate each distractor* The magnitude of cj2(d:i.|D*) will be 
large when, on the average, raters judge an item's distractors to vary in their 
difficulty, or attractiveness, to examinees. By contrast, the magnitude of 

A 

a2(rd:i|D*) reflects disagreement or varieO^ility among raters in their judgments 
of distractor attractiveness for an item. To put it another way, the magnitude 

A 

of a2(rd:i^|D*) reflects the extent to which raters disagree in their judgments 
about which distractors could be eliminated. 

A 

If we consider a2(d:i|D*) = 0.0629 as an estimate of "true" variability 
among distractors, then our estimate of "error" for 11^ = 5 raters is: 

(rd:i D*)/n^ = 0.1814/5 = 0.0363. 

Evidently, the "error" variance (attributable to the differential attractiveness 
of distractors for different raters) is almost fifty percent as large as the 

ERJC 2U 



Cutting Score Procedures 
16 



"true" variance ainong distractors. This suggests that^ for these data ^ even when 
raters agree on the number of distractors that can be eliminated ^ there are sub- 
stantial differences among raters concerning wliich discractors can be eliminated. 

Eliminating Correct Alternative 

In conducting a study with the Nedelsky procedure ^ it is usual to provide 
complete items ^ including the correct alternatives / to each rater. If the correct 
alternatives for all items are specified for the raters, then it is reasonable to 
expect that no rater would eliminate the correct alternative for any item — assum- 
ing, of course, that the items are well-constructed and the raters take their 
task seriously. 

On the other hand, if the correct answer is not specified for the raters, 
then perhaps some raters will eliminate the correct alternative for some items. 
This is indeed what happened in this study. Specifically, the numbers of correct 
answers eliminated by raters 1 to 5 were 11, 9, 26, 14, and 16, respectively. We 
found no evidence of clerical error, or mis-keyed items to explain these results, 
and we have no reason to question the extent to which raters took their task 
seAously. However, it is likely that individual raters had differing degrees 
of familiarity with the content tested by specific items. 

When a rater indicates that the correct answer could be eliminated by a 
minimally competent excuninee, one could argue that the (inferred) probability 
assigned by the rater to the item should be zero, no matter how many distractors 
are eliminated by the rater. However, for the purposes of this study, we did 
not adhere to this argument. Rather, we followed Nedelsky 's procedure, as he 
described it, and a«5signed probaJbilities on the basis of eliminated distractors 
only. It is interesting to note that if we had assigned a probability of zero 
whenever a rater eliminated the correct alternative, X would decrease and esti- 
mates of variability would increase. ^ ] 
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A Comparison of the Two Procedures 

Since the Nedelsky and Angoff procedures were both applied to the same items 
by the same raters ^ the data from these two procedures (data sets 1 eind 2) can 
be analyzed jointly in a single design. Specifically, the appropriate analysis 
involves the £ x r x i^ design, in which the two procedures (£i are crossed with 
both raters and items. Table 7 provides equations for estimating the variemce 
coit?>onents for this design / and Table 8 provides the numerical values of these 
estimated varicuice con{>onents for our data. 



Insert Tables 7 and 8 about here 



The variance components, identified as o^(a) in Table 8 are obtained by 
letting ^ approach infinity in Table 7; and these are called random effects 
variance components. The variance components identified as o^(a|P*) in Table 8 
are obtained by letting n^ = ^ in Table 7; and these variance components are 
based on the assumption '.-hat procedures are fixed. The variance components 
a^(a|£*) are appropriate when we restrict our interest to the actual procedures 
in our study. Strictly speaking, here, the variance components o^CajP*) seem 

A 

more appropriate than the random effects variance components, (a) , because it 
seems difficult to consider these two procedures as a sample from some very large 
set of similar cutting score procedures. However, the random effects variance 
components are very useful in illustrating relationships between results for the 
£ X r X i^ design and the two r x i^ designs discussed previously. 
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Tables 7 and 8 also provide equations and numerical values for estimates of 
the varieibility of X, where is, in this case, the mean over raters / items , and 
procedures* For exan^le, Table 8 reports that X (over procedures) is .6097 (78^82), 
which is the mean of the X^'s reported in Tables 1 and 4. 

The reader should note, however, that the estimates of the variatbility of 
X in Table 8 are not averages of the corresponding estimates in TaO^les 3 and 5. 
For example, c7(x|P*) = 0.0195 (2.46), which is similar to o (X) = 0.0182 (2.29) in 
Table 3 for the Angoff procedure, but quite different from a (X) = 0.0336 (4.24) 
in Table 5 for the Nedelsky procedure* This pattern of results also holds for 

A A 

a(x|p*,R*) and cj(x|P*,I*). One inference that might be drawn from these observations 
is that there would be no particular advantage in actually setting a cutting score 
by averaging JC from both procedures—assuming one is primarily interested in mini-- 
mizing the variability of X. 

Perhaps the most interesting result in Table 8 is that the variance com- 
ponents that contain £ are relatively large, indicating that there are substantial 
differences between the two procedures and the probabilities that result from them. 

A A 

For exair^Jle, a^(£|P*) is etbout four times larger than o^(r|F*), suggesting that 
there is considerably more variability attributable to differences in procedure 
means than to differences in rater means (over procedures) . From another per- 
spective, it can be shown that the observed variance in the two procedure means is: 

AAA 

-s _ . o^(E£) o2(£|i.) 0^ (pri) 
a^U ) = 0^ (£) + + — + . 



In other words, the variance components that contain £ contribute directly to the 
disparity we have identified in the procedure me£ms. 
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The results reported in Table 8 are based upon the same data that led 
to the results in Tetbles 3 and 5, for the two procedures sepeurately. It 
might seem, therefore , that there ought to be some relationships between the 
variance conponents in Table 8 and those in T£±>les 3 and 5. This is indeed 
the case. The reader can verify that the average of the varieuice components 
for raters in Tables 3 and 5 is: 

[o2(r) + a2(r)]/2 - a^{r) + a2(£r); 
1 ~ 2 - - ^ 

where variance components to the right of the equality are for the £ x r x ;^ 
design in Table 8, (r) is the vciriemce component for raters, for the Angoff 

A 

procedure in Table 3, and (r) is the variance conponent for raters, for the 

2 

Nedelsky procedure, in Table 5. Similarly, 

A A A A 

Co2(i) + a2(i)]/2 = o2(i) + o2(£i), and 
1 - 2 ~ 

A A A A 

[o2(ri) + a2(ri)]/2 = (ri) + g^Cpri) . 

In effect. Table 8 crystallizes mcmy of the differences between the two pro- 
cedures evident in con^aring Table 3 with Table 5. 

Differences in Sample Statistics for Raters 

We can also examine differences between the two procedures using the 
sanple statistics reported in Tables 1 and 4. In examining these differences, 
we will occasionally point out (without proof) relationships between the results 
in Tables 1 cmd 4 cmd the generalizability analyses results in Tables 3, 5, and 

a. 
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Correlations and Covariances Among Raters ^ Within Procedures > Using 
Tcdjles 1 and 4, the reader can verify that the average of the rater inter- 
correlations for the An^off procedure is 0»187^ and the corresporiOiixg result 
for the Nedelsky procediare is 0,222. In terms of covariances^ these averages 
are 0,0061 and 0.0125 for the Angoff and Nedelsky procedures, respectively 
The magnitude of these average covariances is influenced by the degree to 
which similar probabilities are assigned to items. Indeed, the average of 

the rater covariances is simply {i) for the Angoff procedure, and (i) 

1 2 

for the Nedelsky procedure. Evidently, there is more variability over items 
in the probabilities assigned using the Nedelsky procedxxre. We will see 
further evidence of this fact, below. 

Rater Means. Figure 1 provides a scatterplot of the rater mecins (over 
items) for the Angoff procedure (see Table 1) and the Nedelsky procedure (see 
Table 4). The reader can verify that the correlation in Figure 1 is -0.052; 
and, it can be shown that the covariance (in terms of the random effects variance 
conponents in Table 8) is 

a2(r) + a2(ri)/n. ^ (-0.0002) + 0.0071/126 = -0.0001. 
— — .-^ 

Clearly, there is little, if any, linear relationship between the two procedures 

2 

in terms of the five rater means. Note that this result is not influenced by 
the difference in the grand means (X"s) for the two procedures. 

It appears from Figure 1, however, that there are two clusters of raters- 
Raters 2,3 and Raters 1, 4,5. Given the small numbers of raters involved, we 
hesitate to say that there is a strong correlation among raters within clusters; 
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however^ Figure 1 certainly does not preclude thif» possibility. In any case^ 
Raters 2 and 3 aure outstauiding in that they assign relatively low probabilities 
using the Nedelsky procedure and relatively high probcibilities using the Angoff 
procedure . 



Insert Figure 1 about here 



Rater Standar d Deviations > Figure 2 provides a scatterplot of the statis- 

tics a(X J for each rater^ by both procedures. Recall that, for a given rater 
i "TEi. 

and procedure, a(X .) is the steuidard deviation of the probabilities assigned 
i 

to items. We observe that the standard deviations for the Nedelsky procedure 
are soinewhat higher than those for the Angoff procedure, which is consistent 
with the fact that the variance components for items and interactions are 
higher for the Nedelsky procedure. Again, however. Rater 3 and, to some ex- 
tent, Rater 2 appear to be different from the other three raters. Specifically, 
for both procedures. Raters 2 and 3 exhibit less varieO^ility in the probabili- 
ties they assign to items. 



Insert Figure 2 about here 
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Differences in Sainple Statistics for Items 

In principal/ we could construct tables / cinalogous to T£UDles 1 and 4, 
that would report/ for both procedures^ sainple statistics for eV':ry item. 
Since we have 126 items / however / the resulting tables would be too large 
and too detailed to be very informative. Rather ^ we provide four perspectives 
on items statistics in two figures and two tables. 

Figure 3 provides a frequency polygon for the average (over raters) of 
the probabilities assigned to items by both procedures; and Figure 4 provides 
a frequency polygon of the stemdaurd deviation of the probabilities assigned 
to items. Consistent with previously discussed results. Figure 3 indicates 
that the modal probability (interval) is considerably higher for the i^goff 
procedure. Also, consistent with previous results. Figure 4 indicates that 
there is somewhat more variability in the probabilities assigned to items 
using the Nedelsky procedure. Most importantly, however, the Nedelsky 
standard deviations in Figure 4 are bimodal. As discussed below, this bi- 
modality is not an artifact of theoe data — it is a result that is virtvially 
guaramteed by the Nedelsky procedure, per se . 



Insert Figures 3 and 4 about here 
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Recall that, for each rater the probability assigned to an item by the 
Nedelsky procedure is the inverse of the number of non-eliminated alternatives. 
For the four-alternative items that characterize this study, this procedure 
for assigning probabilities means that the only (inferred) probabilities that 
can be assigned to em iteni using the Nedelsky procedure are 0.25, 0.33, 0.50, 
and 1.00. In particular, note that there can be no probability between 0.50 
and 1.00. Now, onsider the probabilities assigned by raters to an item. If 
all raters assign probabilities in the range 0.25 to 0.50, the standard devia- 
tion will be relatively small; and, of course, if they all assign probabilities 
of 1.00, the standard deviation will be zero. However, the standard deviation 
will be relatively large when some raters assign a probability of 1.00, and 
other raters assign probeQailities of 0.50 or lower. 

Th« biroodality in Figure 4, then, seems almost certainly a direct result 
of having only a small number of unequally spaced probabilities with the 
Nedelsky procedure. Furtiiernvore , this peculiar characteristic of the prob- 
ability scale is a plausible explanation for the fact that our estimates of 
the variability of X are higher for the Nedelsky procedure than for the Angoff 
procedure. (See Tables 3 and 5.) Also, the restricted nature of the Nedelsky 
probability scale may account for the differences in the means for the two 
procedures, at least to some extent. To examine these issues in more detail, 
let us consider 'Tables 9 and in. 



Insert Tables 9 and 10 about here 
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Tables 9 and 10 provide relative frequency distributions (over items) 
for the probabilities assigned using the Angoff and Nedelsky procedures ^ 
respectively. Inspection of these tables reveals several points of interest. 
Firsts no rater assigned probabilities below 0.20 using the T^goff procedure. 
This implies that the range of probabilities for the two procedures is about 
the same; and, consequently ^ differential restriction in rsmge is not a 
factor of ixi5)ortance in this study. Second^ for the Nedelsky procedure, on 
the average, probabilities below 0.50 were used for 28 percent of the items, 
whereas for the Angoff procedure, they were used for only 7 percent of the 
items. Third, for the Angoff procedure, on the average, probabilities in 
the range 0.60 to 0.95 were used with 53 percent of the items, whereas the 
Nedelsky procedure precluded use of such probeQDilities. 

These points and visual inspection of Tables 9 and 10 reveal a consistent 
tendency for raters to assign more homogeneous probeUDilities using the Angoff 
procedure. Furthermore, it appears that a rater who uses a probcd^ility of 
0.33 or 0.50 with the Nedelsky procedure is very likely wO use a somewhat 
higher probability when given the opportunity to do so with the Angoff pro- 
cedure . 

Operationalizing Conception s of Minimum Competence 

There are many ways in which the Nedelsky and Angoff procedures appear 
to be very similar. For example, they both involve raters' judgments about 
individual items; they both yield, directly or indirectly, a matrix of rater- 
by-item probabilities, and, given this matrix, the computational process for 
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arriving at a cutting score is the same for both procedures. The procedures 
obviously differ in that probabilities are directly elicited in the Angoff 
procedure, whereas probabilities are inferred from eliminated distractors in 

the Nedelsky procedure. 

It is also possible that the two procedures differ, to some extent, in 
the way they technically allow a rater to operaticnalize a conception of mini- 
mum competence. In the Angoff procedure, to arrive at a probability, a rater 
might conceptualize a group of minimally coxnpetent persons and reflect upon 
what proportion would get the item correct. Alternatively, for the Angoff 
procedure, a rater might conceptualize a single minimally competent person 
and reflect upon what proportion of the times this person would correctly 
respond to the item, if it were administered a large number of times. 

For the Nedelsky procedure, however, there are only as many distinct 
probabilities that can be assigned (indirectly) as there are alternatives to 
the item, and these probabilities are not equally spaced. Logic suggests, 
therefore, that neither of the above two conceptualizations works very well 
with Nedelsky procedure. For example, if a rater believes that 75 percent of 
a group of minimally competent persons would get an item correct, the rater 
cannot eliminate some number of alternatives that will yield a probability of 
0.75. Technically, the rater cannot even report the average number of alter- 
natives that a group of minimally coinpetent persons would eliminate, unless 
this number is an integer. 

It seems, then, that the Nedelsky procedure constrains a rater to con- 
ceptualize minimum competency in terms of the performance of a single person 
on a single administration of an item, with the additional constraint that 
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this person will respond based upon a process of eliminating distractors. We 
know of no compelling empirical evidence to suggest that examinees (specifi- 
cally, minimally competent examinees) generally responrl to an item based ui)on 
a process of eliminating distractors r even though this process is frequently 
recommended to potential examinees. However, even if examinees do respond in 
this manner, there still seem to be relatively clear differences in the con-- 
ceptualization of minimal coitqpetence implicit in the Angoff and Nedelsky 
procedures. 

This study cannot directly address the extent to which different con- 
ceptualizations of minimum coitpetency may have influenced the study *s results; 
and it is not likely that raters gave this matter a great deal of oonscioxxB 
consideration. Nevertheless, cuiy cutting score procedure necessitates some 
conceptualization of minimum conpetence; it seems likely that the conceptual- 
izations are different for the Angoff and Kadelsky procedures; and evaluators 
are probably well-advised to consider such differences in choosing a cutting 
score procedure in a given context. 

Cutting Scores other than X 

It is important to note that, tluroughout this paper, we have assumed that 
the cutting score, X, that results from either procedure is the mean of X , 
for all raters who participated in the study. For example, we pointed out that 
Raters 2 and 3, in our study, appear to be different from the other three raters. 
However, we did not suggest eliminating them from the study for the purposes of 
calculating a cutting score. In our opinion, unless there is clear evidence that 
a rater did not adhere to the intended procedure, it is probably not generally 
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advisable to eliminate atypical raters in determining the cutting score. (We 
assume, of course, that raters were chosen carefully in the first place.) How- 
ever, if an atypical rater were eliminated, it would be best to redo analyses 
ui)ing the remaining raters, only. We madce this suggestion because the elimina- 
tion of ax\ atypical rater, after the study is completed, probcibly inplies that 
one has chcingod one*s conceptualization of the intended population of raters. 

Reconciliation Pro cess . Sometimes , rather than using X as the cutting 
score, it is suggested that a cutting score be determined by a reconciliation 
process. In principal, such a process might be applied in conjunction with 
either the Nedelsky or the Angoff procedure. For example, after the five raters 
in this study completed the Angoff procedure, they were instructed, as a group, 
to reconcile their differences on each item. In Table 1 the mean (over items) 
of these reconciled ratings is denoted £(£) • One typical result of using a 
reconciliation process is that certain raters tend to dominate, or lanequally 
influence, the reconciled ratings. This is indeed what happened in our study, 
as indicated by the high correlations between the actual an6 reconciled ratings 
for Raters 1 and 2. The effect of this dominance by Raters 1 and 2 is that the 
reconciled cutting score (0.70) is quite a bit different from X (0.66) . 

There is certaia logic in using a reconciliation process that appears to 
be conpelling. One might argue that the ideal result of using either the 
Nedelsky or the Angoff procedure is for raters to agree on every item. There- 
fore, why not force them to concur? One argument against this logic is that 
forced consensus is not really agreement, although forced consensus may effec- 
tively hide disagreement. Also, we point out that a reconciliation process does 
not guarantee that the same cutting score will result each time a study is 
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replicated. If a study is replicated a large number of times, with different 
raters, the average reconciled cutting score might be considerably different 
from the average X or X, in Equation 1; however, there could be as much, or 
even more, variability in the distribution of reconciled cutting scores as 
there is in the distribution of X*s» We do not mean to imply, however, that 
a reconciliation procedure should be avoided, necessarily; rather, we wish to 
emphasize that use of reconciliation procedure involves complexities over and 
above those encompassed by either the Nedelsky or the Angoff procedure • 

Nedelsky's Cutting Score . When Nedelsky original] y described his pro- 
cedure he did not actually suggest using X (or n^, X) a cutting score. 
Rather, the cutting score he suggested usiv)g 



IL^ + k a^^ 
-FD — FD 



We find Nedelsky's discussion of M , k, and a somewhat confusing • However ^ it 
appears that M^^^ is intended to be the mean teat score for a qroup of "border-line" 
examinees, only (Nedelsky, 1954, p. 5); o^^ is the standard deviation of this 
distribution; and k is an a ££iori defined constant used to classify these 
"border-line" examinees into passing and failing examinees. Since Nedelsky 
suggests using our n^^X as an estimate of Ky^, it is clear that his cutting 
score will equal n. X only if k is defined as zero of is zero. 

It is not clear to these authors why one would use M„-. + c; as a cutting 
score if one actually had test scores for a known group of ^^border-line" exami- 
nees. In such a case, the test data themselves would likely provide a reasonably 
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sound basis for defining a cutting score independent of raters' judgments* We 
infer/ therefore / that Nedelsky probably wants us to consider a hypothetical 
group of borderline excuninees. We have already argued that there may be a 
logical inconsistency ir. conceptualizing a group of minimally competent exam- 
inees when one uses the Nedelsky procedure. However / even if we overlook thia 
issue I we are still faced with the problem of estimating (a parameter for 

a test score distribution) using only the raters' probabilities. 

It can be shown that the formula suggested by Nedelsky (1954, p. 12) for 

estimating o^^ is 
FD 



* Z Z X . (1 - X .)/n 
FD ^ ^ -ri -ri -r 



« n.[X(l - X) - a? (r) - (i) - o2(ri)]; 



where X^^ is the (inferred) probeU:>ility assigned to item i^ by rater r. Nedelsky 
provides a re^tionale for his estimate of ^^pjj? hut, in our opinion, his rationale 
is weak in that it confounds considerations of pareuneters and estimates. However, 
even if one accepts his formula for estimating o^pj^# the very process of defin- 
ing a cutting score as M^^^ + k a^^ requires fairly strong assumptions and a 
substantial degree of subjective judgment over and above that required to esti- 
mate the cutting score n.X, Whether or not such complexity is advisable 
depends ufx^n the specific context of the cutting score decision process; however, 
there arc probably not many contexts in which this complexity is warranted and 
the procedure is easily defended. 
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Measurement ReliaQpility or D ependgdpility 

In the Nedelsky and Angoff procedures the cutting score that results may 
be viewed as the mean, over items and raters ^ of the probabilities assigned to 
items. In this paper ^ we have denoted this mean as X, and suggested that the 
magnitudes and sources of error in a cutting score procedure may be examined 
through studying the expected variance of X. For both the Nedelsky and Angoff 
procedures^ there are three possible estimates of this variance, depending upon 
whether a decision-m2Ucer wants to generalize to a population of raters and a 
universe of items , to a population of raters for a fixed set of items, or to a 
universe of items for a fixed set of raters. 

More specifically, in generalizability theory the (observed mean) cutting 
score, X, resulting from a particular study is an unbiased estimate of X in 
Equation 1. (Recall that X is the cutting score that would result if the 
population of raters used the Nedelsky or Angoff procedure with the \xniverse 
of items.) However, we know that if a study were replicated a large nxamber of 
times, it is very likely that the X's from these studies would vary; and any 
such variation reflects error in using X from a single study as an estimate 
of A. It is usually not possible to conduct a number of replications of a 
cutting score procedure; but, even so, generalizability theorj' enables us to 
estimate the expected variance in the distribution of X (see Table 2). it is 
the expected variamce of the distribution of means that we have examined in 
considerable detail. Again, however, there is not just one estimate of this 
variance. There are many estimates depending upon (a) whether one wishes to 
generalize over raters, items # or both; and (b) the sizes of the samples of 
raters and items used to calculate X. 
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We enphasize that the numerical results reported in this paper are for a 
specific study, only; and, as such, these results are illustrative, rather them 
definitive. Nevertheless, there appear to be noticesU^le differences in the 
means (or cutting scores) for the two procedures. Also, for each procedure, 
there is evidence of error, as reflected in the expected variances of the dis- 
tributions of means over replications; and these variances frequently have 
considerably different magnitudes for the two procedures. Given these results, 
it seems reasonable to consider their potential impact on issues of reliability, 
or measurement dependability. A coii^lv<^te discussion of these issues is beyond 
the intended scope of this paper. We will, however, consider these issues in 
the context of the index of dependability defined by Equation 6 in Table 

11. 

This index was developed by Brennan and Kane (1977a) using general izcUbility 
theory and the linear model for the £ x i^ design: 

Y . = y + y^'^ + y + y^> • (7) 

(See also Brennan, 1978, and Brennan and Kane, 1977b.) In Equation 7, Y .is 
the observed score of person £ on item J^, and the terms to the right of the 
equality are the score effects or components for the decomposition of Y . . 
The linear models in Equations 1 and 7 are formally identical. We have used 
different notation in each of them for the purpose of emphasizing that Equation 
1 is applied to a rater-by-item matrix of probabilities, whereas Equation 2 is 
applied to the per son-by- item matrix of observed scores. 
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For any meaningful joint use of Equations 1 and 1 , the item universe must 
be the same for both equations although the effect for items in Equation 1, 

is different from the effect for items in Equation 1 , Most inpor- 

tantly, X and y in Equations 1 and 7 are very different. The parameter 
X is the cutting score (or grand mean of the probabilities) for the population 
of raters and universe of items; whereas the paurameter y is the grand mean of 
the observed scores, Y for the population of persons and the universe of 
items. 

Using generalizability theory and the linear iwxiel in Equation 7, Brennan 
and Kane (1977a) derived Equation 8 in Table 11 as an estimate of their index 
of dependability. This estimate is identified as *(X) in Equation 8 to empha- 
size that it is based on the assumption that X is somehow known , without error. 
This assuxnption is reflected in the term (Y - X)^ in the numerator and denomi- 
nator of Equation 8. When X is not known, however, and we use X from a par- 
ticular study as an estimate of X, then this term is no longer appropriate. 



Insert Table 11 about here 



Ptirthermore , we may not simply replace X with X in the term (Y - X)^ 
because the expected value of a squared quantity is not equal to the square 
of the expected value. Rather, the expected value of (Y - X)2 is 

Tr^i^X - I)^ " (Y - X)2 f a2(X), (9) 
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if we wish to generalize over Seunples of raters (R) and samples of items (I). 



If we wish to generalize over san4)les of items, only, then 




(Y - X)2 t a2(x|R*). 



(10) 



Table 2 provides equations for a^(x) and o2(x|r*) in terms of variance com- 
ponents foi the effects in Equation 1; «md Equations 9 and 10 can be derived 
using an approach employed by Brennan and Kane (1977a, p. 280). 

It follows from Equations 9 and 10 that when we use X as ein estimate of 
\i we should also subtract a^iX) or a2(x|R*), as appropriate, from bcth the 
numerator and the denominator of Equation 8 in TeQ^le 11. The two resulting 
(modified) estimates of the index of dependeQ^ility *(X) are provided by 
Equations 11 and 12 in Table 11. 

Let us return now to the original question that motivated our development 
of Equations 11 and 12 — namely, what effect do different values for ? and its 
expected varieUDility , for the N6;delsky and Angoff procedures, have on reliability 
or measurement dependability? Without loss of generality, we restrict ourselves 
to considering In Equation .11 for generalizing over raters and items. Since 

i^(X) can be no greater than one, decreasing the numerator and denominator in 



Brennan-Kane index. This is to be expected, because we have introduced additional 
sources of error attributable to the procedure used to establish a cutting score. 

Furthermore, "all other things" being equal, the larger the magnitude of 
:r2(x), the smaller the magnitude of "KX). Since our study results suggest that 
a^(X) is larger for the Nedelsky procedure, we might expect ^(X) to be smaller 



'Equation 8 by a^- (X) results in decreasing the magnitude of the estimate of the 
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for the Nedelsky procedure. However, "all other things" are not equal unless 
the cutting scores for the procedures are equal • When they are unequal , as 
we found, the magnitude of ( Y - X) 2 will be different; and this difference, 
in t\im, will affect the nvignitude of *(X). Moreover, whether or not higher 
values of X will result in higher values of (Y - X) ^ depends entirely upon the 

A A 

magnitude of 7. In brief, the effect of X and (X) on the magnitude of ^(X) 
cannot be predicted independent of test data for examinees; and, it is not 

A 

necessarily the case that lower values of (X) are always associated with 
higher values for estimates of measurement dependability • 

A 

Note that we have not suggested that cj2(x|l^*) be considered in the con- 
text of modifying the Brennan-Kane index. Of covirse, there is an equation 
analogous to Equations 9 and 10 — namely, 

P (7 - X) 2 (7 - X) 2 + a2 (x| I*) , 
in which generalization is over Seunples of raters only. However, in this 

A 

equation, items are considered fixed; and if we incorporate 0'^{x\l^) into an 
estimate of tCX) we must then consider items fixed in estimating the other 
variance components, too* To do so means that there is no larger universe 
of IteJis (or tests) to which we wish to generalize; and, under such circum- 
stances, estimates of reliability, generalizability, or dependability for the 
£ X i^ design are usually undefined. 
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Sunanary and Conclusions 

Based upon an application of general Izablllty theory to a rater-by-item 
matrix of probabilities^ we have provided and discussed equations for esti- 
mating the expected variability in a cutting score determined by the Nedelsky 
or Angoff procedure c Our development assumes that the cutting score in a 
particular study is the observed mean (probability) over raters and items, 
and that this mean may be viewed as an estimate of an "idealized" cutting 
score, defined as the mean for a population of raters and a uuiiverse of items. 
In this sense, the expected varifiUjility of the observed mean is error variance 
attributable to a particular application of the procedure used to define a 
cutting score. 

We have applied this approach to data sets resulting from the application 
of the Nedelsky and Angoff procedures by five raters to a 126-item test. Also, 
we have examined these results for each procedure separately, and we have com- 
pared results over procedures. Our data indicate that both the cutting scores 
and their ejqpected variances are considerably different for the two procedures. 
We have postulated that these differences may be explained, in whole or in part, 
by differences in the ways probabilities are assigned using the two procedures, 
or differences in the ways minimum competency is conceptualized. Both explana- 
tions depend heavily on the fact that the Nedelsky procedure necessarily (al- 
though indirectly) restricts a rater to a small discrete number of unequally 
spaced probabilities. 

In examining the two procedures, we have considered several issues not 
directly associated with variability in the mean cutting score, X. Our data 
suggest, for example, that even when raters agree on the number of alternatives 
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to eliminate using the Nedelsky procedure, these same raters may disagree on 
which alternatives to eliminate. Also, we found that raters, using the Nedelsky 
procedure, eliminated a considerable number of correct alternatives. Purther- 
nvore, we have briefly discussed some issues associated with use of a reconcilia- 
tion procedure euid the elimination of atypical raters. 

Finally, we have exeunined the influence of different values of X, and the 
expected variamce in the distribution of on reliiUbility , or measurement 
dependability. To do so, we developed a modification of the Brennan-Kano 
index of dependability, tCA). We found that, for a given value of X, increas- 
ing the expected variance of X results in decreasing the estimate of 4^ (A). 
However, if both X and its varieuice change, then the estimate of $(X) could 
increase, decrease, or even remain unchanged — depending on results for examinee 
test data. 

The numerical results reported in this paper are for a single study, only. 
As such, they surely do not form a sufficient basis for passing judgment on 
either the Nedelsky or the Angoff procedure. Even so, these data do suggest 
that the differences between these procedures may be of greater consequence 
than their apparent similarities. In particular, the restricted nature of the 
Nedelsky (inferred) probeibility scale may constitute a basis for rejecting this 
procedure in certain contexts* 
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Appendix 

The X Design and Scutpling from Finite Universe 

Table 2 provides equations for obtaining estimated r£mdom effects variance 
components for the £ x i^ design. We emphasize that these variance con5>onents 
are for a ramdom effects model in which both the size of the population of 
raters I and the size of the universe of items ^ N^, are assumed to approach 
infinity^ (i*e.^ 1!^ -> <») • The equations in Table 2 for estimating 

the variability of mean scores do distinguish/ however, between G study sample 
sizes and D study sample sizes. (D study sample sizes are identified with 
primes • ) 

The G study (i.e., generalizability study) sample sizes are the actual 
numbers of raters emd items on which G study data axe available; and these are 

A A A 

the sample sizes used to calculate o^(r), o^iiji and (ri) in terms of mean 
squaji*es. A decision-maker # however, may ba interested in a D study involving 

A 

consideration of the expected value of statistics, such as a^CX) for sample 

sizes that are different from the G study sample sizes. Of coxirse, the G 

study and D study may be the saune study, in which case there is no distinction 

between G study and D study saunple sizes. 

We wish to develop a general expression for the expected variance of the 

mean, X, for samples of n** raters from a population of any size, N , and saitples 
— "TT — r 

of items from a universe of any size, N^. We begin by expressing expected 
mean squares in terms of variance components using the Cornfield and Tukey (1956) 
procedures (treated by Millman and Glass, 1967; Kirk, 1968; and others) : 
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)'[MS(r)] - (1 - n^/N^)o2(ri) + n^o2(r|N^), (a.1) 
"i^CMSU)] » (1 - n^/Nj.)a2(ri) + n^o2(i|N^); and (A.2) 
|;[MS(ri)] - o2(ri). (A. 3) 

In Equations A.l to A. 3 it is in^rtemt to note that a^CrlN.) is not identical 
to the random effects variance component o^(r) unless -»• »; simileurly, 

^illij.) is not identical to (i) unless ->• «; and (ri) is unaffected by 
the size of auid/or . Also, note that the sample sizes in Equations A.l 
to A. 3 are for the G study — not the D study. 

Now# in terms of estimates of the variance conponents in Equations A.l to 

A.3, 



\o2(£|n^) / \ o2(i|N^) 




n^ \ o2(ri) 



1 . (A.4) 



-d / -r-i 



Equation A.4 results from well-known principles concerning the variamce of the 
distribution of means for (D study) seunples of size n' randomly sampled from a 
popultion or universe of size N (see, for exaiflple, Cochran, 1977, p. 23). A 
•light modification of Equation A.4 is sometimes discussed in treatments of 
•atrix MSf>ling (see, for example, Sirotnik and Wellington, 1977, p. 354). 
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Equation A.4# however^ is sometimes awkward to use for a particular D 
study, because G study results are usually reported in terms of estimated 
random effects variance components — not the estimated G study variance com- 
ponents o^(r|N^) and o^uIn^), for sampling from a finite xiniverse. Brennan 
(1977) has shown that 

a2(r|lL) - a2(r) + o2(ri)/N,, (A^5) 
—1—^ — — — 1 

and 

a2(i|N ) » o2(i) + o^(ri)/}i ; (A.6) 



where estimated variance conponents to the right of the equalities are the 
random effects varianct components in Table 2. 

Given equations A. 5 and A.6, Equation A. 4 can be expressed as: 




+ I 1 ^ (jc^^7) 

N N. / n'nT 
r 1 / r 1 



Equation A. 7 can be used to obtain an unbiased estimate of the expecced varicince 
of X for any values of , n% H , and N, using random effects variance components, 

— ""T 

only. Let us consider, for example, the three special cases in Table 2. If N 
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and both approach infinity, then Equetion A. 7 is o^(X) in Table 2. If N 
approaches infinity, and nj^ = N^, then Equation A.7 is a^ix}!*) in Table 2. 
If approaches infinity, and n^ » N^, then Equation A.7 is a2(x|R*) in 
Table 2. 
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Footnotes 

The junior «*uthor is currently Director, Southeastern Regional Test 
Development Center, Educational Testing Service^ Atlanta, Georgia • 

^For the r x (d:i^) design, it can be argued that a rater may not examine 

a given distractor independent of the other distractors for an item, If so, 

then one independence assuiqption associated with the linear model for t:his 

design becomes suspect, at least to some extent. This issue, however, is 

relatively unimportant here because our analysis is intended only to summarize 

data that have an indirect bearing on the principal analyses in this paper. 
2 

By definition, a variance component must be positive; however, estimates 
of variance conponents are occasionally negative. When a negative estimate 
occurs, sometimes it is advisable to treat it as zero (see Cronbach et al, 
1972, and Brennan, 1977) , and at other times it is best to leave the estimate 

A 

unchanged (see Sirotnik, 1970). Here, we do not set (r) to zero because it 
is a mathematical fact that a covariance of the type in Figure 1 is exactly 

o2(r) + i2(ri)/n^, 

as shown by Cronbach et al. (1972, Chapter 8). This is true even if (r) is 
negative. Indeed, a negative covariance could never occur unless (t) and/or 
(ri) were negative. 
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Table 1 

Means, Standard Deviations, and Inter correlations 
Airong Raters for the Angoff procedxire 



Intercorrelations 



Mean 
over 



over 









£{5) 


r(c) 


Items t 
X 

-r 


items : 
i ^ 


r(l) 


.525 .053 


.046 


.150 


.731 


0.6713 


0.2033 


r{2) 


.171 


.206 


.382 


.744 


0.7194 


0.1607 


r(3) 




.161 


-.036 


.237 


0.6559 


0.1193 


r(4) 






.209 


.217 


0.6167 


0.1869 


r(5) 








.432 


0.6525 


0.2183 


r(c) 










0.6984 


0.1541 




X - 0.6632 


(83.56)^ 




o(?.) 


» 0.0373 


(4.70)*' 



r(c) is a reconciled rating arrived at by the raters themselves. 
Nuaabers within parentheses are expressed in terms of number of items. 
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Table 2 

Varlemce Components Notation and Formulas for 
Random Effects r x 1 Design 



o2 (r) - Cms (r) - MS (ri) ]/n^ 
o^U) » [MSU) - MS(ri)]/n^ 
a2(ri) » MS(ri^) 
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Number of raters =* n (G study); (D study) 



Niimber of items - n, (G study); nT (D study) 



o2(X ) - o2(r) + a2(ri)/n; 

o2(X) - o2(r)/n^ + o2(i)/n^ + o2(ri)/n^ 

a2(x|R*) = o^(i)/n^ + o2(ri)/n^nr 

A A A 

a2(x|l*) « o2(r)/n' + a2 (ri)/n;.n; 



(3) 
(4) 
(5) 
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Table 3 

ANOVA, Variemce Components, and the VarieOjility of Me€m 
Scores for the Angoff Procedure 



Effect (a) 



df 



SS 



r 
i 

ri 



125 



500 



0.7000 



7.1443 



13.3526 



MS 



0.1750 



0.0572 



0.0267 



a2(a) 



0.0012 
0.0061 
0.0267 



a2(X) 
a2 (X I R*) 
(Xll*) 



0.0014 
0.00033 
0.00009 
0.00028 



a(Xj.) = 

A 

a(X) 

A 

a(x|R*) » 

A ____ 

a(x|l*) = 



0,0373 (4-70) 

0.0182 (2.29) 

0.0095 (1.20) 

0.0167 (2.10) 



Note. The terms o'^ {a) are, more specifically, (r) , o'^iL), 
a^(ri). Results in the second half of this table for the 
variability of mean scores assume that n^ » n « 5 and 
nT « n, « 126. 

Results within parentheses aj.^ e^ressed in terms of n\iznber 
of items. 
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Table 4 





Means r Standard Deviations^ 


and Intercorrelations Among Raters 




for ProbcUoility of Correct 


Response 


from Nedelsky Procedure 








Mean 


S.D. 




Intercorrelations 




over 
items: 


over 
items: 










r(l) r(3) r(4) 


£(5) 


X 


o(X ,) 
i 


r(l) 


.307 .118 .196 


.377 


0.6438 


0.2828 


r(2) 


.065 .204 


.350 


0.5337 


0.2317 


r(3) 


.161 


.195 


0.4495 


0.1827 


r(4) 




.242 


0.5700 


0.2376 


r(5) 






0.5844 


0.2310 



X = 0.5563 (70.09)^ a(X)= 0.0717 (9.03)* 



Results within parentheses are exi>ressed in terms of number of items. 
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Table 5 

MHOW?., Variance Con^nents, and Variability of Mean Scores 
for Probability of Correct Response from Nedelsky Procedure 



Effect 


(a) 


df 


SS 


MS 


A 

a2(a) 


r 




4 


2.5890 


0.6473 


0.0048 


i 




125 


13.1817 


0.1055 


0.0125 


ri 




500 


21.4284 


0.0429 


0.0429 


A, 


(y - 


0.0051 


) 


« 0.0717 


(9.03) 


A 


(X) 


0.0011 


0(X) 


= 0.0336 


(4.24) 


A 


(x|r*) - 


0.0002 


A 

0(x|r*) 


» 0.0130 


(1.64) 


A 

02 


(x|i*) - 


0.0010 


0(x|l*) 


= 0.0321 


(4.04) 



Note , The terms o^{a) are more specifically (r) , a^(i^) ^ and a^ixi). 
Results in the second half of this table, for the variability of mean 
scores, assume that n^ » n « 5 and nT « n » 126, 
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ANOVA 


and Variemce Components for 


Eliminated 


Dis tractors 




Using Nedelsky Procedure 




Effect (a) 


df SS 


MS 


o2(a|D*) 


r_ 




^ • 


0.0058 


L_ 


125 42.6245 


0.3410 


0.0121 


d:i 




0.4957 


0.0629 


rl 


500 79.9844 


0.1600 


0.0533 


rd;i 


1008 182.8853 


0.1814 


0.1814 


a^Cr |d*) 


= CMS(r) - JIS (ri) ]/n^n^ 






a^alD*) 


= [MS(i) - MS(ri)]/n^nj 






o2(d:i|D*) 


= [MS(d:i) - MS(rd:i)]/n^ 






a2(ri|D*) 


» MS(ri)/n^ 






(rd:_i|D*) 


= MS(rd:i) 







Mean over items of proportion of distractors eliminated for raters 
1 to 5: 

X = 0.6984, 0.6085, 0.5053, 0.6561, 0.6878 
— r 

X = 0.6312 0(X ) = 0.0786 
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Table 7 

Equations for Estimating Variance Con^xinents and the Expected Variance 
of the Meem Score for the £ x r x Designj 



Effect Estimated Vc^Lriimce Coirponent 



£ 


[MS(£) 


£ 


{MS(r) 


i 


{MS(i) 


E£ 


[MS(£r) 


2k 


[MS(£i) 


ri 


CNS(ri) 



[MS(£) - MS(pr) - MS(£i) + MS(pri)]/n n 



n^/N^)[MS(£i) - MS(pri)]}/n.n^ 



"E-E 

Pri MS (prl) 



When n^ » N^, we identify the variamce components as a2(o|p*). in terms of 
these variance components: 

A ^ A A 

a2(x |P*) - a2(r|P*} + a2(ri|p*)/n; 

a2(x|P*) - a2(r|p*)/n^ + a2(i|p*)/n^ + a2(ri|P*)/n^ 

a2(x|P*,R*) » a2(i|P*)/n' + (ri |p*)/n'n; 
— ~ — — -u — "^^Tt, 

a2(x|p*,£*) - a2(r|p*)/n' + (ri |p*)/n"nr 
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Table 8 

ANOVA and Variance Components for Probability of 
Correct Response with Both Procedures 



Effect 


(a) 


df 


SS 


MS 




o2(a) 


A 

o2(a|P*) 


E 




1 


3.5989 


3.5989 




0.0050 


0.0050 






4 


1.5537 


0 . 3884 




-0.0002 


0.0014 


i 




125 


15.6601 


0.1253 




0.0074 


0.0083 


££ 




4 


1.7354 


0.4339 




0.0032 


0.0032 


pi 




125 


4.6652 


0.0373 




0.0019 


0.0019 


ri 




500 


20.9520 


0.0419 




0.0071 


0.0210 


£ri 




500 


13.8285 


0.0277 




0.0277 


0.0277 


Means 


o^7■er procedures smd 


Items for 


raters 1 to 


5: 






X 

-T 


= 0.6576 


, 0.6266, 0. 


5527, 0.5934, 0.6185 








X 


= 0.6097 


A A 

(78.82) o(X ) = o(X 
— r -r 


|P*) = 0.0396 


(4.99) 






(xIp*) 


= 0.00038 






) 


= 0.0195 


(2.46) 


A 


(X|P*,R*) 


= 0.00010 




A 

o(x|p* 


,R*) = 0.0100 


(1.26) 


02 


(x1p*,i*) 


= 0.00031 




A 

0 (x|p* 


,1^*) = 0.0176 


(2.22) 
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Table 9 

Relative Frequency Distribution by Raters by Probability 
of Correct Response Using Angoff Procedure 



P^^obability Relative Frequency 

of Correct 

Pesponse* r(i) r(2) r(3) r(4) r(5) Average 



<0. 


20 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


(0.20, 


0.25) 


0.03 


0.02 


0.00 


0.07 


0.06 


0.04 


(0.30, 


0.35) 


0.02 


0.00 


0.00 


0.02 


0.06 


0.02 


(0.40, 


0.45) 


0.00 


0.00 


0.06 


0.00 


0.00 


0.01 


(0.50, 


0.55) 


0.42 


0.24 


0.19 


0.44 


0.36 


0.33 


(0.60, 


0.65) 


0.01 


0.04 


0.31 


0.00 


0.01 


0.07 


(0.70, 


0.75) 


0.26 


0.33 


0.23 


0.37 


0.31 


0.30 


(0.80, 


0.85) 


0.01 


0.23 


0.20 


0.00 


0.02 


0.09 


(0.90, 


0.95) 


0.12 


0.07 


0.01 


0.10 


0.04 


0.07 


>0.95 


0.13 


0.08 


0.00 


0.00 


0.15 


0.07 


X 

-r 




0.67 


0.72 


0.66 


f 

0.62 


0.65 


0.66 



Raters were constrained to report their probabilities in units of 0.05. 
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Table 10 

Relative Frequency Distribution by Raters for Probiibility 
of Correct Response Using Nedelsky Procedure 



Probability Relative Frequency 



of Correct 
Response 


r(l) 


r(2) 


r(3) 


r(4) 


r(5) 


Average 


0.25 


0.06 


0.06 


0.07 


0.05 


0.00 


0.05 


0.33 


0.16 


0.25 


0.42 


0.16 


0.17 


0.23 


0.50 


0.40 


0.52 


0.43 


0.57 


0.60 


0.50 


1.00* 


0.38 


0.18 


0.08 


0.22 


0.23 


0.22 


X 

—r 


0.64 


0.53 


0.45 


0.57 


0.58 


0.56 



^Our analyses of these data used a probability of 0.99/ rather than 
1.00/ for coding convenience. 
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Table 11 

Equations for the Brennan-Kane Index of Dependability 

and a Modification 



o2(£) + (p - X)2 

♦ (X> - (6) 



o2(£) + (Y - A)2-o2(Y) 

" ~ Z : (8) 

o2(£) + (Y - X)2 - o2(Y) + o2(A) 



where 

o2 (£) - [MS (£) - MS (£i) 

^n 



o2 (i) - [MS (i) - MS (£1) ]/n 



o2(£i) - MS(£i) 



o2(Y) - o2(£)/n^ + 0^(1) /n£ + o2(£i)/n^ 



A _ o2(£) + (Y - X)2 - a2(Y) - o2(X) 

♦ w - r — Z ^ (11) 

o2 (£) + (Y - X) 2 - o2 (Y) - o2 (X) + (A) 



o2(£) + (Y - X)2 - o2(Y) - o2(x|r*) 

♦ (x|r*) - -X ::: — ^ r— z ^ (12) 

o2(£) + (Y - X)^ ~ 0^(1) - <J^(x|R*) + o2(A) 
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70 



,60 • 



,55 



4- -r 



• 2 

1 

• 1 



,65 H *^ •S^ 



• 4 



0 -I- I I 1— 

0 .45 .55 .65 

Nedelsky 



Figure 1. Rater means (over items) for 
probability of a correct response using 
Nedelsky and Angoff procedures. 
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4^ 



30 



,20 



.10 - 



10 



#3 



• 5 



•4 



•1 



• 2 



-r- 

.20 



.30 



Nedelsky 



Figure 2. Standard deviations, for each rater, 
of the probatbilities assigned to items using 
Nedelsky and Angoff procedures. 
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0) 
H 

O 



I 



50 



40 - 



30 



20 



10 



— Nedelsky 
— Angof f 




<.25 .3 



,B >.85 



Mean Probability Assigned to Item 



Figure 3. Frequency polygon of ths means {over 
raters) of the probabilities assigned to iherns. 
Frequencies are plotted for the midpoints of 
intervals having width 0.10. 
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Figure 4. Frequency po?.ygon of the standard 
deviations (over raters) of the probabilities 
assigned to items. Frequencies are plotted 
for the midpoints of intervals having width 
0.05. 



